This workshop is mainly adapted from the mixOmics tutorials (http://mixomics.org) and the Bioconductor vignette (https://bioconductor.org/packages/release/bioc/html/mixOmics.html).
This session introduces dimensionality reduction methods for two matched (two-omic) datasets obtained from the same samples.
The main purpose of Canonical Correlations Analysis (CCA) is the exploration of sample correlations between two sets of variables observed on the same individuals whose roles in the analysis are strictly symmetric.
Regularised canonical correlation analysis (rCCA) is a statistical method that extends canonical correlation analysis (CCA) by adding a regularisation term to the objective function. Regularisation helps to prevent overfitting, which is a problem that can occur when a model is too complex and learns the noise in the data instead of the true underlying relationships.
rCCA can be used to find linear relationships between two sets of variables, even when the data is noisy or has a large number of variables.
This session will also cover projection to latent structures (PLS) regression, which builds components from two datasets to maximise covariance.
library(mixOmics)
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: ggplot2
##
## Loaded mixOmics 6.24.0
## Thank you for using mixOmics!
## Tutorials: http://mixomics.org
## Bookdown vignette: https://mixomicsteam.github.io/Bookdown
## Questions, issues: Follow the prompts at http://mixomics.org/contact-us
## Cite us: citation('mixOmics')
data(breast.TCGA)
# Extract training data and name each data frame
# There are 3 dataframes (3 omics of the same samples), stored as list
# In each dataframe, the rows represent samples.
# Columns represent feature measurments (genes, miRNA, protein)
X <- list(miRNA = breast.TCGA$data.train$mirna,
mRNA = breast.TCGA$data.train$mrna,
protein = breast.TCGA$data.train$protein)
Y <- breast.TCGA$data.train$subtype
table(Y)
## Y
## Basal Her2 LumA
## 45 30 75
We can use imgCor() to view the raw correlation between sets of variables.
In this example x11() is used to capture the image in an XQuartz window, as a workaround to limited viewing space in RStudio. It is not a necessary plot.
# produce a heat map of the cross correlation matrix
x11()
imgCor(X$miRNA, X$mRNA,
type = "combine", keysize = 0.05,
sideColors = color.mixo(4:5), row.cex = 0.25, col.cex = 0.5)
savePlot("correlationPlot.miRNA.mRNA.png", type = "png")
imgCor(X$mRNA, X$protein,
type = "combine", keysize = 0.05,
sideColors = color.mixo(5:6), row.cex = 0.25, col.cex = 0.5)
savePlot("correlationPlot.mRNA.protein.png", type = "png")
imgCor(X$mRNA, X$protein,
type = "combine", keysize = 0.05,
sideColors = color.mixo(c(4,6)), row.cex = 0.25, col.cex = 0.5)
savePlot("correlationPlot.miRNA.protein.png", type = "png")
mixOmics implements rCCA with (a) the Cross-Validated Ridge or (b) Shrinkage procedures to overcome overfitting, or finding false patterns/correlations from many comparisons. The cv ridge approach adds a penalty term of rCCA to prevent overfitting, whereas the shrinkage approach reduces estimates of the canonical coefficients towards zero. Generally CV-Ridge does not shrink the coefficients towards zero as much as the shrinkage method. The shrinkage method is more effective at preventing overfitting, but it also makes the model less powerful. CV-Ridge is computationally much slower than Shrinkage and may not work for a large amount of variables (>5000), but has advantages in preserving the power of the model.
# set grid search values for each regularisation parameter
grid1 <- seq(0.001, 0.2, length = 15)
grid2 <- seq(0.001, 0.2, length = 15)
# (slow step)
cv.tune.rcc.mRNA.protein <- tune.rcc(X$mRNA, X$protein, grid1 = grid1, grid2 = grid2,
validation = "Mfold", folds = 6, plot = TRUE)
saveRDS(cv.tune.rcc.mRNA.protein, file = "cv.tune.rcc.mRNA.protein.rds")
Note that this step can be time-consuming. The output object can be saved as an rds file, which you can load into a later session.
cv.tune.rcc.mRNA.protein <- readRDS("cv.tune.rcc.mRNA.protein.rds")
Examine output of cross-validation tuning.
cv.tune.rcc.mRNA.protein # examine the results of CV tuning
##
## Call:
## tune.rcc(X = X$mRNA, Y = X$protein, grid1 = grid1, grid2 = grid2, validation = "Mfold", folds = 6, plot = TRUE)
##
## lambda1 = 0.1573571 , see object$opt.lambda1
## lambda2 = 0.01521429 , see object$opt.lambda2
## CV-score = 0.814176 , see object$opt.score
Get the optimised L1, L2
opt.L1 <- cv.tune.rcc.mRNA.protein$opt.lambda1 # extract the optimal lambda values
opt.L2 <- cv.tune.rcc.mRNA.protein$opt.lambda2
# formed optimised CV rCCA
cv.rcc.mRNA.protein <- rcc(X$mRNA, X$protein, method = "ridge",
lambda1 = opt.L1, lambda2 = opt.L2)
shrink.rcc.mRNA.protein <- rcc(X$mRNA,X$protein, method = 'shrinkage')
# examine the optimal lambda values after shrinkage
shrink.rcc.mRNA.protein$lambda
## lambda1 lambda2
## 0.1210357 0.1504092
From this point onwards, the steps are the same for CV-Ridge and Shrinkage rCCA models. For simplicity, we’ll carry on mainly with the results obtained using the Shrinkage procedure.
# barplot of shrinkage method rCCA canonical correlations
plot(shrink.rcc.mRNA.protein, type = "barplot", main = "Shrinkage")
The plotIndiv() function in this context can display the the samples in “XY-space” (X and Y referring to the two datasets, X$mRNA and X$protein).
# plot the projection of samples for CV-Ridge rCCA data
plotIndiv(cv.rcc.mRNA.protein, comp = 1:2,
ind.names = row.names(X$mRNA),
group = Y, rep.space = "XY-variate",
legend = TRUE, title = 'rCCA CV-Ridge XY-space', legend.title = "Cancer\nSubtype")
# plot the projection of samples for CV-Ridge rCCA data
plotIndiv(shrink.rcc.mRNA.protein, comp = 1:2,
ind.names = row.names(X$mRNA),
group = Y, rep.space = "XY-variate",
legend = TRUE, title = 'rCCA Shrinkage XY-space', legend.title = "Cancer\nSubtype")
# arrowplot
plotArrow(shrink.rcc.mRNA.protein, group = Y,
col.per.group = color.mixo(1:3),
title = 'rCCA', legend.title = "Cancer\nSubtype")
plotVar(shrink.rcc.mRNA.protein, var.names = c(FALSE, FALSE),
cex = c(4, 4), cutoff = 0.5, col = c("black", "black"), overlap = FALSE,
title = 'rCCA mRNA + protein')
network(shrink.rcc.mRNA.protein, comp = 1:2, interactive = FALSE,
lwd.edge = 2,graph.scale = 0.25,
cutoff = 0.5, color.node = color.mixo(5:6),
save = "png", name.save = "network")
# focus on component 1
network(shrink.rcc.mRNA.protein, comp = 1, interactive = FALSE,
lwd.edge = 2,graph.scale = 0.4,
cutoff = 0.6, color.node = color.mixo(5:6),
save = "png", name.save = "network.comp1")
# focus on component 2
network(shrink.rcc.mRNA.protein, comp = 2, interactive = FALSE,
lwd.edge = 2,graph.scale = 0.4,
cutoff = 0.25, color.node = color.mixo(5:6),
save = "png", name.save = "network.comp2")
When applying cim() to integration methods (two omics), the mapping parameter controls what association matrix is used for graphing. By default, it will use the combined association matrix (mapping = “XY”). This means that this time each cell in the heat map represents the correlation between a feature from each ’omic dataset. If mapping = “X” or mapping = “Y” is used, each cell represents the raw expression data from the input X or Y dataset respectively.
cim(shrink.rcc.mRNA.protein, cutoff = 0.4, comp = 1:2,
xlab = "Protein", ylab = "mRNA",
col.cex = 0.5, row.cex=0.5, margins=c(6,5),
save = "png", name.save = "rcc.mRNA.protein.cim")
In the previous session we used PLSDA on a single-omic dataset, which found latent components (built from a linear combination of key features) that could drive apart the experimental groups of samples (categorical outcome).
‘Partial Least Squares’ or ‘Projection to Latent Structures’ (PLS) regression aims to find predictive features of a continuous outcome in another dataset, e.g. in this example data perhaps there are gene expression patterns that track very well with the expression of certain proteins, or miRNA that track well with the mRNA that they regulate.
Types
Modes
# start with a basic sPLS model
spls.miRNA.RNA <- spls(X = X$miRNA, Y = X$mRNA, ncomp = 5, mode = 'regression')
First, we need to optimise the number of components in the model.
# repeated CV tuning of component count
perf.spls.miRNA.RNA <- perf(spls.miRNA.RNA, validation = 'Mfold',
folds = 6, nrepeat = 5)
plot(perf.spls.miRNA.RNA, criterion = 'Q2.total')
# set range of test values for number of variables to use from X dataframe
list.keepX <- 2^(3:7)
list.keepX
## [1] 8 16 32 64 128
# set range of test values for number of variables to use from Y dataframe
# we may be motivated to have keepY low, so set a lower test range
list.keepY <- c(3:10)
tune.spls.miRNA.RNA <- tune.spls(X$miRNA, X$mRNA, ncomp = 2,
test.keepX = list.keepX,
test.keepY = list.keepY,
nrepeat = 1, folds = 6,
mode = 'regression', measure = 'cor')
plot(tune.spls.miRNA.RNA) # use the correlation measure for tuning
# set range of test values for number of variables to use from X dataframe
list.keepX <- round(2^(seq(2,6,0.5)))
list.keepX
## [1] 4 6 8 11 16 23 32 45 64
# set range of test values for number of variables to use from Y dataframe
# we may be motivated to have keepY low, so set a lower test range
list.keepY <- c(8:12)
tune.spls.miRNA.RNA <- tune.spls(X$miRNA, X$mRNA, ncomp = 2,
test.keepX = list.keepX,
test.keepY = list.keepY,
nrepeat = 1, folds = 6,
mode = 'regression', measure = 'cor')
plot(tune.spls.miRNA.RNA) # use the correlation measure for tuning
We can take the optimal values from these test results, and make a more better model.
# extract optimal number of variables for X dataframe
optimal.keepX <- tune.spls.miRNA.RNA$choice.keepX
# extract optimal number of variables for Y datafram
optimal.keepY <- tune.spls.miRNA.RNA$choice.keepY
optimal.ncomp <- length(optimal.keepX) # extract optimal number of components
optimal.keepX
## comp1 comp2
## 6 6
optimal.keepY
## comp1 comp2
## 11 9
optimal.ncomp
## [1] 2
Final model:
# use all tuned values from above
final.spls.miRNA.RNA <- spls(X$miRNA, X$mRNA, ncomp = optimal.ncomp,
keepX = optimal.keepX,
keepY = optimal.keepY,
mode = "regression") # explanatory approach being used,
# hence use regression mode
sPLS is unsupervised, but we’ll plot the colour-coded group information as previously.
plotIndiv(final.spls.miRNA.RNA,
ind.names = FALSE, group = Y, legend=TRUE,
legend.title = "Cancer\nSubtype")
# plot with rep.space = "XY-variate" (averaged component subspace)
plotIndiv(final.spls.miRNA.RNA, rep.space = "XY-variate",
ind.names = FALSE, group = Y, legend=TRUE, title = "sPLS miRNA:mRNA",
legend.title = "Cancer\nSubtype")
The length of arrows will display the level of agreement between datasets.
plotArrow(final.spls.miRNA.RNA, rep.space = "XY-variate",
ind.names = FALSE, group = Y, legend=TRUE, title = "sPLS miRNA:mRNA",
legend.title = "Cancer\nSubtype")
“Stability” of feature selection should be inspected. We can use the perf() function.
# form new perf() object which utilises the final model
perf.spls.miRNA.RNA <- perf(final.spls.miRNA.RNA,
folds = 6, nrepeat = 30, # use repeated cross-validation
validation = "Mfold", cpus=3,
dist = "max.dist", # use max.dist measure
progressBar = FALSE)
## Warning: The SGCCA algorithm did not converge
# plot the stability of each feature for the first two components,
# 'b' type refers to plotting histogram
par(mfrow=c(1,2))
plot(perf.spls.miRNA.RNA$features$stability.X$comp1, type = 'h',
ylab = 'Stability',
xlab = 'Features',
main = 'Comp 1', las =2,
ylim = c(0,1), xlim = c(0, 30))
plot(perf.spls.miRNA.RNA$features$stability.X$comp2, type = 'h',
ylab = 'Stability',
xlab = 'Features',
main = 'Comp 2', las =2,
ylim = c(0,1), xlim = c(0, 30))
plotVar(final.spls.miRNA.RNA, cex = c(2,3), var.names = c(FALSE, TRUE), cutoff = 0.6)
color.edge <- color.GreenRed(50) # set the colours of the connecting lines
network(final.spls.miRNA.RNA, comp = 1:2,
cutoff = 0.6, # only show connections with a correlation above 0.6
shape.node = c("rectangle", "circle"),
color.node = color.mixo(4:5),
size.node = 0.3,
color.edge = color.edge,
save = 'png', # save as a png to the current working directory
name.save = 'sPLS.miRNA.mRNA.network')
cim(final.spls.miRNA.RNA, comp = 1:2, xlab = "mRNA", ylab = "miRNA",
margins = c(5,8),
cutoff = 0.6, save = "png", name.save = "sPLS.miRNA.mRNA.cim")